In this project, I analyzed a data set on movies that is publicly available on Kaggle at the link below: https://www.kaggle.com/tmdb/tmdb-movie-metadata
The data set contains 4,803 rows of data and 20 columns. It contains information on a wide range of movies such as their title, revenue, budget, average voter rating on IMDB, and many other variables that may be of interest. For purposes of analysis, I had to create several additional columns derived from the source data.
My primary intent in analysis was to evaluate whether there are any variables that have significant impact on a movie’s success. Depending how compelling my findings are, certain variables can be used in the predictive modeling of a movie’s success.
The term “success” has many interpretations. The metrics I chose to focus on were profit from ticket sales, and the average voter rating from user reviewes on IMDB.com.
For the purposes of analysis, I had to make the following transformations to the source data set.
The analysis and visualizations below are for preliminary interpretation of the data. The immediate intention was to spot any interesting trends when comparing a few variables at a time. As I was doing this, I also wanted to gauge whether the dataset should be filtered down at all to minimize outlier data points. Lastly, I was curious to see which variable(s) had the highest correlation to high profits and voter ratings.
The Box and Whiskers plot below helps give a sense for the distribution of normalzied movie budgets by release year. This plot type shows the quartile summaries and outlier data points by year. It appears that each release year had at least a few outliers. Years 2005-2015 particularly seemed to have an increasing amount of outlier “big budget” films
The scatter plot below adds an additional variable dimension (bubble size and color) that is based on the normalized profit of a movie. The data points are plotted on an axis of release year vs. normalized buget.
The plot below is similar, expect that bubble size and color is based on the average voter rating. I filtered out movies that had < 30 total votes for the average voter rating.
The scatter plot below replaces the release year x-axis with average voter rating. Based on this plot, it doesn’t seem like there is much correlation with voter rating and normalized budget per movie. This will be explored in more detail later on.
Much of my analysis below is driven by my original question of “What makes a movie successful?”. The two key metrics I’ve chosen to measure success are normalized profit, and average voter rating.
The correlation matrix below helps highlight which variables have the highest correlation to profit and voter rating.
It appears that average voter rating, movie runtime, and particularly normalized budget have some correlation to normalized profit. Therefore, I decided to create a linear model that projects the impact of these combined variables on normalized profit.
For reference, the standard equation for a linear regression model is Y-intercept value + Slope of Line(variable1)+ Slope(variable2)…. + Standard Error =Beta0 + Beta1(x1) + Beta1(x2)…..+ E
To determine the most suitable regression model for predicting the profit of a movie, I had to calculate several different regression models that had different combinations of confounding variables. I also had to test the statistical significance of each regression model to see whether it would be applicable to all movies (not just this sample data set). I used a combination of F-test and T-test formal hypothesis testing for this. There were two regression models I felt were most suitable for further analysis.
The new linear regression model I’ve calculated below only includes budget and voter average as predictors of profit. -408.4 + 1.98(norm_budget) + 64.03(vote_average). A 3D plot for this is below. It has an
This model is likely not a good predictor for movie profit because the R-squared value is 0.35. This means that only 35% of the variability in profit can be explained from this model. Most people would not trust a model with such a low precision level.
Within this multivariate regression model, I wanted to see whether the significance of it (proven in F-test) can be attributed to both or just one of these variables. This is determined by separately running a t-test on both the budget and voter average variables. When running t-tests on each variable, I’m essentially testing the “fit” of a linear model for one variable at a time.
I determined that normalized budget was the best single variable as a predictor of normalized profit. The linear regression equation for this model is: -6.82 + 1.98x. The 99% confidence interval for the slope is 1.98 +/- 2.577*0.0619 (1.82, 2.14). The model is plotted against the scatter plot of datapoints below.
The scatterplot for this model is seen below.
Even though this simple linear regression model had a slightly R-squared value of 0.28, therefore losing some value as predictive model for profit, it proved that normalized budget had the highest correlation to normalized profit compared to all other numeric variables.
To further explore this correlation, I decided to perform some regression diagnostics. This woul help identify any data outliers that may have distorted the model.
For every model, we expect some residual error. This refers to how far the actual data points vary from the expected predicted value from a linear model. When evaluating the residual errors, two key things to look for are:
Is the degree of residual error constant across the entire regression line? To rephrase this differently, is the variation of the response variable (profit) consistent around the regression line?
Are the residuals around the regression line normally distributed?
As seen above, the residual points are very inconsistent around the regression line. The vast majority of movies with budgets < 150 million have increasingly negative residuals. Even if the residuals were constant, there are also numerous outlier points (particularly for high-budget movies) that likely have strong influence on the regression model. A regression model would have to be re-run with these data points removed. Another option would be to run a regression model limited to only movies in the 150-250 million dollar range. The residuals look more constant there.
I additionally looked to see whether there was a normal distribution of the residuals. With the exception of the outlier points, the overall distribution looks normal. However, looking at this plot alone wouldn’t illustrate whether the distribution of positive/negative residuals is constant throughout the entire regression line.
For the sake of having constant residuals and removing outliers, I considered recalculating a linear regression model for only movie budgets between 150-250 million and less than 1 billion in profit. The residual plot for this is below. Even when filtering on this range of data points, you can see that a linear regression model is not a good fit. Too many of the data points have a high residual and are far from the regression line. The R-squared coefficient is extremely low at 0.14.
In conclusion, I was unable to find any type of linear regression model that would serve as a good predictor of normalized movie profit. Budget had the highest correlation of all variables, but was still not a very strong predictor.
For more in-depth analysis of how and which other linear regression models were tested, see my additional [page] (https://cdn.rawgit.com/omshapira/Movie-Analysis/a7abba15/Movie_Analysis_Code%2BObservations.html#linear-regression-modeling) with code and additional commentary.
In addition to performing linear regression testing on the numeric variables (runtime, budget, voter rating), I was curious whether there were any categorical non-numeric variables that had an impact on normalized profit. The first one that I chose to analyze was the month of the year when a movie was released.
To see whether there is a seasonal impact on the profit of a movie, I altered my dataset to average movie profits by month of year. My hypothesis was that summer movies would have higher profits, since more people are likely to go out and see movies in the summertime.
The first graphic I created to analyze this was a boxplot. Due to all of the outlier data points (high buget movies), it was a little hard to interpret this chart. While it did appear that summer movies (as expected) had higher profits, I suspected that this was due to other confounding variables such as budget.
The first test of ANOVA I performed was comparing the difference in profit between months when accounting for all other numeric variables (budget, runtime, and voter average). This will ensure that the difference in profits by month isn’t impacted by these confounding variables.
I ran an ANCOVA test that compared the difference in profit between months when accounting for other confounding numeric variables budget and voter average (runtime was proven not to be a significant factor for this). When tested at an alpha level of 0.95, the global F-test proved that there was a statistically significant difference in profit across all release months. This means that at least two of the months within this analysis have a significant difference between each other in profit when controlled for confounding variables.
I continued to calculate the least-squares means for the normalized profits by release month. I visualized the findings below.
The plot above compares the least-squares means (indicated by the blue dots) between normalized movie profits when controlling for the confounding variables normalized budget and voter average. Based on the this sample of data, the mean profit of some months (like June-August) were higher than others (November-December) - even if the movies within those months had the same exact budget and voter rating. The tick marks near the blue dots reflect the 95% confidence interval for the mean normalized profit.
I additionally plotted the LS means for profit when only accounting for the movie budget variable. This plot produced similar results.
These results were somewhat surprising for several reasons.
August shows the highest LS mean. I was expecting June/July to have the highest LS means, since movie studios typically release their big blockbuster movies at the beginning of summer. However these same blockbusters do tend to have higher budgets, which would reduce it’s LS mean value. As seen from the initial bar chart above, the average budget for movies in August is much lower than June/July (explaining the higher LS mean value). One could inference that many people are still willing to see movies released in August (perhaps a final rush before school starts).
November/December have surprisingly low LS mean profits. The Golden Globes and Academy Awards (held in Januay and March, respectively) require that a movie is released prior the January 1st to be eligible for nomination. For that reason, movie studios typically try to release their most critically-acclaimed movies in late winter while it would still be fresh in viewers’ minds for the upcoming “Awards season”. Though these “critically-acclaimed” movies aren’t necessarily high profit, I was expecting these months to have higher LS means than other winter/early spring months. The relatively high budgets of movies in November/December, paired with the fact that people don’t go out as much in the winter, is likely the culprit for the low LS mean values.
For movie studios focused on critic/user reviews more than profit, I was also curious to plot the LS means for voter average when controlling for budget. The plot above shows that December does have a relatively high LS mean average rating (due to movies released for “Awards season”). However I was surprised that December voter ratings were not any higher than certain late Spring and Summer months. Perhaps people rate movies higher in the summer when they’re in a better mood? Or perhaps the movies released in winter are not necessarily better (with maybe exception of a select few nominees).
In theory, one could apply this model for predictive purposes about the optimal month to release a movie. However, one should be cognizant that outliers may have a significant impact on these results. There also may be other categorical variables that have an impact on profit. I was curious to see whether movie genre had any impact on certain specific months of the year. More on this below.
To see whether movie genre had an impact on normalized movie profit and average voter rating, I produced box and whiskers plots for some preliminary analysis. Once again, this type of plot was a bit difficult to evaluate because there were many outliers and categories being compared. For the rest of my analysis, I filtered on genres that had > 30 movies within the data set.
*One main caveat of my genre analysis is that I only extracted the first listed genre for each movie. If a movie had multiple listed genres, only the first one was extracted with the assumption that this represents the main genre classification.
Now that I’ve gotten a sense for the data categorized by genre, I performed a one-way ANCOVA - similar to what was done for analyzing release months. Before I reassessed the impact of release month on profit when controlling for movie genre, I first actually wanted to prove whether there was statistically significant difference between the mean profits of different genres.
Running the F test above proves that there is a difference between normalized profits for each genre when controlling for budget and voter average. I additionally plotted the LS means of profit between genres below.
As can be seen, Family and Adventure movies have the highest LS mean for normalized profit (when controlling for budget and voter average)
I additionally reconducted the same type of one-way ANCOVA testing for evaluating the impact of genre on voter rating.
First, I used the F-test again to determine whether there was a statistical difference in voter rating across all genres.
## [1] 1.792379
Then, I calculated the LS means for voter rating by each genre of interest when controlling for budget. Once again, Drama and Adventure movies seemed to be the most successful.
Now that I’ve determined genre to be a statistically significant variable that impacts normalized profit and voter rating, I decided to include it in the ANOVA testing I had previously done for move release month.
To simplify my analysis, I chose to only focus on comparing two different months at a time for only two genres. I expect that the determined significance of impact for release month and genre on profit and voter rating to be of different magnitude with this smaller data subset.
As a first step, I performed a global F-test on a two-way ANCOVA that includes the genre and release month categorical variables.
It turns out that when release month, genre, and budget are combined in a model, genre and release month do not have a significant impact on the difference in expected profit means. Budget was by far the major driving factor in differing profits for August and September.
Meanwhile, it turns out that genre and release month together would have a significant impact on the different expected means for voter rating. Budget was not a big factor for these. As a reminder, these conclusions are only applicable to subsetted data that is limited to movies released in August and September and with the genres of comedy or drama.
Before it is safe to infer conclusions from this Two-way ANCOVA model, it is also important to test for any interaction between the two categorical variables release month and genre. Interaction helps determine whether the impact of one categorical variable has a consistently positive or negative impact on the other across all groups of the other categorical variable.
The interaction plot above shows that there appears to be very little interaction between the genre and release month variables - the two regression lines are almost parallel. If the two lines had very different slopes or intersected, it would imply the effect of genre would not be consistent across August and September. When formally testing for this interaction above, it returned an insignificant p-value (> 0.05). Given this lack of interaction, both the genre and release month variables are safe to include in a two-way ANCOVA model. The resulting LS Means plot is shown below.